Basic Analysis

Type : Type of transaction made

Days for shipping (real) : Actual shipping days of the purchased product

Days for shipment (scheduled) : Days of scheduled delivery of the purchased product

Benefit per order : Earnings per order

Sales per customer : Total sales per customer made per customer

Delivery Status : Delivery status of orders: Advance shipping , Late delivery , Shipping canceled , Shipping on tim...

Late_delivery_risk : Categorical variable that indicates if sending is late (1), it is not late (0).

Category Id : Product category code

Category Name : Description of the product category

Customer City : City where the customer made the purchase

Customer Country : Country where the customer made the purchase

Customer Email : Customer's email

Customer Fname : Customer name

Customer Id : Customer ID

Customer Lname : Customer lastname

Customer Password : Masked customer key

Customer Segment : Types of Customers: Consumer , Corporate , Home Office

Customer State : State to which the store where the purchase is registered belongs

Customer Street : Street to which the store where the purchase is registered belongs

Customer Zipcode : Customer Zipcode

Department Id : Department code of store

Department Name : Department name of store

Latitude : Latitude corresponding to location of store

Longitude : Longitude corresponding to location of store

Market : Market to where the order is delivered : Africa , Europe , LATAM , Pacific Asia , USCA

Order City : Destination city of the order

Order Country : Destination country of the order

Order Customer Id : Customer order code

order date (DateOrders) : Date on which the order is made

Order Id : Order code

Order Item Cardprod Id : Product code generated through the RFID reader

Order Item Discount : Order item discount value

Order Item Discount Rate : Order item discount percentage

Order Item Id : Order item code

Order Item Product Price : Price of products without discount

Order Item Profit Ratio : Order Item Profit Ratio

Order Item Quantity : Number of products per order

Sales : Value in sales

Order Item Total : Total amount per order

Order Profit Per Order : Order Profit Per Order

Order Region : Region of the world where the order is delivered : Southeast Asia ,South Asia ,Oceania ,Eastern ...

Order State : State of the region where the order is delivered

Order Status : Order Status : COMPLETE , PENDING , CLOSED , PENDING_PAYMENT ,CANCELED , PROCESSING ,SUSPECTED_FR...

Product Card Id : Product code

Product Category Id : Product category code

Product Description : Product Description

Product Image : Link of visit and purchase of the product

Product Name : Product Name

Product Price : Product Price

Product Status : Status of the product stock :If it is 1 not available , 0 the product is available

Shipping date (DateOrders) : Exact date and time of shipment

Shipping Mode : The following shipping modes are presented : Standard Class , First Class , Second Class , Same D...

Customer Zipcode: 3 missing values.

Order Zipcode: 155,679 missing values; not critical due to other location features.

Product Description: All values are missing; drop the column.

Customer LName: 8 missing values; not significant since Customer Id is used.

All other columns have no missing values

Data Cleaning and Preparation

Outliers analysis

Feature Engineering

Replace 'Order_Country' name with more recognizable name

Change Datatypes

Visualizations

1) Sales by Category

2) Sales Trend Over Date

3) Order Count by Category

4) Country-wise Profit Analysis (Geographical Analysis)

Highlight countries from where orders were placed according to the Total Profit

5) Revenue by Market and Region

6) Loss by Product Category and Region

Loss may be due to late deliveries and fraud.

7) Top 10 Products and Categories with Late Deliveries

8) Which Shipment Type is used usually when there is a Late Delivery

9) 'SUSPECTED_FRAUD' Order Status Analysis

10) Which Customers are doing most Frauds?











Correlation

Numeric Correlation

We can see that 'Order_Item_Product_Price' and 'Product_Price' are same.

Multi-collinearity

Multicollinearity:

Multicollinearity occurs when two or more predictor(Independent) variables in a regression model are highly correlated. This can lead to several problems:

Technique-1 (Linear/Logistic Regression)

  1. Just perform linear regression or logistic regression and note down the weights.

  2. Add some noise in the data like 0.01 or 0.1 to each value in the columns, then perform linear or logistic regression and check the weights.

Features weights which has changed drastically(Compare with previously noted weights) are the rows facing multicollinearity.Baically, weights represent the features. So, check the weights and remove the features which have large changes.

Technique-2 (Variance Inflation Factor)

We check the VIF values and after that we have couple of options that we can use for Handling Multicolinearity:

  1. Remove one variable
  2. Combine correlated variables
  3. Opt for PCA (Principle Component Analysis)
  4. We can ignore if the correlation is not that extreme based on our business used case.

Technique-3 (Lasso Regression)

Lasso (Least Absolute Shrinkage and Selection Operator) regression tackles multicollinearity by shrinking the coefficients of some features towards zero. It achieves this by adding a penalty term to the cost function that penalizes the absolute value of the coefficients. Here's how it helps:

Technique-4 (Ridge Regression)

Ridge regression, on the other hand, addresses multicollinearity by shrinking the coefficients, but it uses a penalty term based on the squared value of the coefficients (L2 norm). Compared to Lasso:

Choosing Between Lasso and Ridge Regression:

The choice between Lasso and Ridge regression depends on your specific goals:

Here are some additional points to consider:

Both techniques require tuning a hyperparameter (lambda) that controls the amount of shrinkage. You can evaluate the performance of models with Lasso and Ridge regression using techniques like cross-validation to see which one performs better for your data and goals.

Several of columns have vif values above 5 and some of them are very high. Such high value usually indicate that value in one column can be derived from values in other columns. Lets look at the data and see if we can find the relation.

When you do some basic math you will find these below relationship among these columns:-

These are perfect multi colinear features that shouldn't be passed on to regression models because it affects the performance of the model. You can drop one of the features from each of the above relationship.

So we can drop some of the columns to avoid multicollinearity





Inventory Optimization

Inventory Optimization for most IN-Demand Product

1. Minimum Order Quantity (MOQ): Unrelated for our analysis

Minimum order quantity is the fewest number of units a business is willing to sell to a single customer at once. While a retail store may be happy to sell a single t-shirt or one head of lettuce, it isn’t usually profitable to sell a single unit. They may require a minimum order of hundreds or thousands of units, depending on the product.

Many manufacturers sell in bulk to wholesalers, who then sell in bulk to retailers using an MOQ. Then individuals can go to the store to buy just one or two of a product.

Concept: Identify an optimal minimum order quantity for each product to minimize ordering costs and holding costs.

Data Analysis:

2) Safety Stock and Reorder Point

Safety_Stock referal : https://www.netsuite.com/portal/resource/articles/inventory-management/safety-stock.shtml

Safety Stock:

Safety stock is the extra inventory a company keeps on hand to buffer against uncertainties in demand or supply.

Reorder Point:

Reorder point is the inventory level at which a new order should be placed to replenish stock before it runs out.

The reorder point is calculated based on the average daily sales and the safety stock level. It ensures that there is enough stock on hand to cover the lead time (the time it takes for the new order to arrive) and the safety stock level to prevent stockouts during unexpected fluctuations in demand or delays in replenishment.

The time it takes for a new order to arrive, also known as lead time, can vary depending on factors such as supplier location, shipping method, and order processing time. To calculate the lead time, you typically need historical data on past orders to determine the average time it takes from the moment an order is placed until the products are received and available for sale.









Days for Shipping Prediction

Data Preparation

Dropping because

Customer Id and Order Id doesn't contain extra information

Customer Zipcode can be used instead of 'Customer State', 'Category Name', 'Customer City', 'Customer Country'

Order Region doesn't provide extra information

Department_Id = Department_Name and 'Category_Id' = 'Category_Name'

'Order_Weekday' = 'Day_Name' and 'Order_Month' = 'Month_Name'

Features

Categorical Encoding

Correlation

Dependent & Independent Variables

Train_Test_Split

Modelling & Evaluation






Predicting if an order is fraud or not

Categorical Encoding

Dependent & Independent Variables

Train_Test_Split

Modelling

Hyperparameter Tuning - RandomForestClassifier

Best parameters found: {'classifiern_estimators': 200, 'classifiermax_features': 'auto', 'classifier__criterion': 'entropy'}